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Abstract 


We describe a hardware scheme to authenticate all 
or a part of untrusted external memory using trusted 
on-chip storage. Our scheme uses Merkle trees and 
caches to efficiently authenticate memory. Proper 
placement of Merkle tree checking and generation is 
critical to ensure good performance. Naive schemes 
where the Merkle tree machinery is placed between 
caches can result in a large increase in memory band- 
width usage. We integrate the Merkle tree machinery 
with one of the cache levels to significantly reduce 
memory bandwidth requirements. 

We present an evaluation of the area and perfor- 
mance costs of various schemes using simulation. For 
most benchmarks, the performance overhead of au- 
thentication using our integrated Merkle tree/caching 
scheme is less than 25%, whereas the overhead of au- 
thentication for a naive scheme can be as large as 
10x. We explore tradeoffs between external memory 
overhead and processor performance. 


Introduction 


Secure processors (e.g., [Yee94] [SW99], [LTMt00]) 
can enable many applications such as copy-proof soft- 
ware and certification that a computation was carried 
out correctly. Additionally they hold promise in help- 
ing to implement secure computing systems that are 
resistant to viruses and other forms of attacks [CPL]. 

Even if secure processors are built, in order to build 
secure computing systems, one of the problems that 
needs to be solved is the authentication of untrusted 
external memory. This memory is typically accessed 
via an external memory bus, or may be networked in 
some fashion to the secure processor. While the se- 
cure processor is running, an adversary may corrupt 
the memory values in the external memory. The se- 
cure processor should be able to detect any form of 
memory corruption. Typically, the secure processor 
will cease the execution of the running task(s) when 
it detects memory corruption.! To achieve security 


lIn this paper, we are not concerned with the correction of 


and high performance, it is critical to have an effi- 
cient authentication scheme that does not slow down 
the processor. In this paper, we describe hardware 
schemes to efficiently authenticate all or a part of 
untrusted external memory using a limited amount 
of trusted on-chip storage. 

We propose a on-line scheme for memory authen- 
tication. Our scheme uses hash trees (also called 
Merkle trees) and caches to efficiently authenticate 
memory. Naive schemes where the Merkle tree ma- 
chinery is placed between caches, e.g., between L1 
and L2 caches, can result in a factor of log N in- 
crease in memory bandwidth usage (where N is the 
memory size), thereby degrading performance signif- 
icantly. In our proposed scheme, we integrate the 
Merkle tree machinery with one of the cache levels 
to significantly reduce memory bandwidth require- 
ments. 

We present an evaluation of the area and perfor- 
mance costs of various on-line schemes using simula- 
tion. For most benchmarks, on a superscalar proces- 
sor, the performance overhead of authentication us- 
ing our integrated Merkle tree/caching scheme is less 
than 25%, whereas the overhead of authentication 
for a nave scheme can be as large as 10x. We show 
tradeoffs between external memory overhead and se- 
cure processor performance. 

We describe related work in Section 1. The as- 
sumed model is presented in Section 2, and moti- 
vating applications are the subject of Section 3. An 
on-line caching scheme for memory authentication is 
described in Section 4. We evaluate the scheme im- 
plemented on a superscalar processor in Section 5. 


1 Related Work 


Merkle trees [Mer80] were proposed as a means to 
update and validate data hashes efficiently by main- 
taining a tree of hash values over the objects. 


corrupted data in memory. An adversary may be able to de- 
stroy the entire contents of memory causing an unrecoverable 
error. Thus, the secure processors we consider do not protect 
against denial-of-service attacks. 


Blum et al. addressed the problem of securing vari- 
ous data structures in untrusted memory using a hash 
tree rooted in trusted memory [BEG*91]. They con- 
sider both off-line detectors that use a hashed trace 
of all accesses (including values read or written, loca- 
tions accessed and times of access) and an on-line de- 
tector, which uses Merkle trees. The on-line scheme 
described by Blum et al. has a O(log(N)) cost for 
each memory access. The off-line scheme targets 
the detection of memory errors, and does not secure 
memory against attacks from an adversary. 

Maheshwari, Vingralek and Shapiro use Merkle 
trees to build trusted databases on top of trusted 
storage [MVSO00]. This work is similar to ours in 
that trusted memory can be viewed as a cache for 
untrusted disk storage — their scheme exploits mem- 
ory locality to reduce disk bandwidth. Our work ad- 
dresses the issues in implementing Merkle tree ma- 
chinery in hardware and integrating this machinery 
with an on-chip cache to reduce the log N memory 
bandwidth overhead. The caching algorithm of Sec- 
tion 4 is more general in that a single hash can be 
used for multiple cache blocks. This scheme can po- 
tentially reduce untrusted memory size overhead and 
cache pollution without increasing cache block size. 

Shapiro and Vingralek [SV01] address the prob- 
lem of managing persistent state in DRM systems. 
They assume that each memory reference results in a 
MAC computation and therefore discount the possi- 
bility of securing volatile storage because it requires 
large overheads. They assume that volatile memory 
is inside the security perimeter. 

In [DS02] allusions are made to a smartcard system 
that would use a Merkle tree with large pages of RAM 
at its leaves, combined with caching of pages in in- 
ternal memory. Their discussion, however is strongly 
directed towards smartcard applications, and they do 
not appear to consider caching nodes of the Merkle 
tree. 


2 Model 


In this paper we are considering a computer system 
with the following properties: 


e There is a high performance processor that con- 
tains a secret that allows it to produce keys to 
perform cryptographic operations such as sign- 
ing or encrypting that no other processor could 
do for it. This secret can be a private key from a 
public key pair as in XOM [LTM™00], or it can 
be a Physical Unknown Function [GCvDD02]. 
Symmetric key schemes are inapropriate as we 
want many mutually mistrusting principals to 
be able to use the system. 


e Operations that take place inside the processor 
are assumed to be private and tamper-evident. 


e The processor has sufficient on-chip storage to 
perform its cryptographic operations on-chip. 


e The processor has a trusted on-chip cache. 


e Everything outside the processor is untrusted, in 
particular the memory. By untrusted, we mean 
that there is an adversary who knows everything 
about the system except for the processor’s se- 
cret. 


The objective of the system is the following: 


e A user wants to use the system to perform a com- 
putation that produces one or many results, to 
which cryptographic primitives are then applied. 
The reason for the cryptographic primitives will 
be illustrated in section 3.1. 


e The computation will involve the processor and 
external memory. 


e We want the computation to be carried out at 
speeds that are as close as possible to the speed 
of a conventional insecure processor. 


e We want a high probability of detecting results 
that contain errors induced by tampering from 
the adversary. This probability must be high 
even if the adversary chooses the program to run. 


The objective of the adversary is the following: 


e The adversary wants to tamper with the mem- 
ory in such a way that the system produces an 
incorrect result that looks correct to the user. 


In this paper we will solve this problem by pro- 
viding an authentication mechanism for the off-chip 
memory. In the next section we show how this in- 
tegrity can be used in applications. 


3 Applications 


3.1 Certifying the Execution of a Pro- 
gram 


The main focus of this paper is to explore memory 
authentication. However, memory authentication is 
only useful if it is performed on a processor that con- 
tains a secret, and the processor is able to perform 
some simple cryptography on it. We will not go into 
the details of the cryptography that is necessary, but 
for ease of understanding, we provide an example of 
use. 

Alice has a problem to solve that requires a lot of 
computing power. Bob has a computer that is idle, 
and that he is willing to rent to Alice. If Alice gives 


Bob her problem to execute, and Bob gives her a 
result, how can she be sure that Bob actually carried 
out the computation? How can she tell that Bob 
didn’t just invent the result? 

One way of solving this is by having a processor 
that has been certified by its manufacturer, that con- 
tains a secret, and that is equipped to deal with prob- 
lems such as the one Alice has. 

Alice sends this processor her problem expressed 
as a program. The processor uses Alice’s program 
combined with its secret key through a collision re- 
sistant scheme to produce a key that is unique to 
the processor-program pair. The processor then exe- 
cutes Alice’s program without allowing any interfer- 
ence from external sources. The processor executes 
the program in a deterministic way to produce the 
result Alice desires. It then uses the key it generated 
to sign the result before sending it to Alice. 

As long as Alice’s computation can all be done on 
the processor, things go well. However, for most algo- 
rithms, it is likely that Alice will need to use external 
memory. How can she be sure that Bob isn’t tam- 
pering with the memory bus to make Alice’s program 
terminate early while still producing a valid certificate 
for an incorrect result? This is precisely the question 
that we try to answer in this paper by providing an 
efficient means to authenticate memory operations. 

When Alice receives the signed result, she is able to 
check it. At that point she knows that her program 
was executed on a trusted processor, and that the ex- 
ternal memory performed correctly.” If the program 
did not contain a bug then Alice has the correct re- 
sult. 

Without the ability to perform some kind of cryp- 
tography, the memory authentication would be use- 
less except to detect faults in the memory (which 
could be detected much more cheaply with simple er- 
ror detecting codes). Indeed, executions carried out 
on the real processor would be identical to results 
carried out on a processor simulator, on which any 
kind of tampering can be done with the data. 

Of course, in real systems Bob will want to con- 
tinue using his computer while Alice is calculating. 
Systems like Palladium or XOM provide this func- 
tionality as we shall see in the following sections. But 
in each case, they still need authenticated memory. 


3.2 Palladium 


Microsoft’s proposed security model, Palladium 
[CPL], may be enhanced by authenticated memory. 
Indeed, Palladium works by providing a mechanism 
whereby the Operating System (OS) can prove to an 


2If Alice’s program stored data on disk, we assume that it 
took measures to authenticate the data. 


application that it is trusted, and that it was loaded 
properly (so no mischievous code was able to load be- 
fore the OS). The application program, knowing that 
it is being protected by a security kernel that was 
properly loaded, can be confident that it will receive 
whatever protection it requires for proper execution. 

But in order to make this model work, it has to be 
impossible for a hacker to break the system’s security 
by modifying the secure kernel as it goes over the 
bus between external memory and processor. This 
can be done by always keeping the security kernel on 
the processor. But this wastes on-chip storage when 
the security kernel is not in use (most of the time). 
Another option is to use authenticated memory such 
as we describe it in this paper. 


3.3. XOM architecture 


The eXecute Only Memory (XOM) architecture 
[LTM™t00] is designed to run security requiring appli- 
cations in secure compartments that can only com- 
municate with the rest of the world on an explicit 
request from the application. 

This protection is achieved on-chip by tagging data 
with the compartment to which it belongs. In this 
way, if a program executing in a different compart- 
ment attempts to read the data, the processor detects 
it and raises an exception. 

For data that goes off-chip, XOM uses encryption 
to preserve privacy. Each compartment has a differ- 
ent encryption key. Before encryption, the data is 
appended with a hash of itself. In this way, when 
data is recovered from memory, XOM can verify that 
the data was indeed stored by a program in the same 
compartment. XOM prevents an adversary from 
copying encrypted blocks from one address to another 
by combining the address into the hash of the data 
that it calculates. 


3.3.1 Exploiting Replay Attacks 


However, XOM’s integrity mechanism is vulnerable 
to replay attacks, which was also pointed out in 
[SVO1]. Indeed, in XOM there is no way to detect 
whether data in external memory is fresh or not. 
Freshness appears to be provided for from one exe- 
cution of a secure application to another by having a 
mutating key (essentially, a different key for each ex- 
ecution). Within a single execution of the secure ap- 
plication, however, the key cannot be changed with- 
out making data that was stored at the beginning of 
execution unreadable. Therefore, an adversary can 
do replay attacks by having the memory return stale 
data that had previously been stored at the same 
address during the same execution. In particular, 


XOM will not notice if writes to memory are never 
performed except when memory is first initialized. 

This flaw in XOM’s authentication could be ex- 
ploited to violate the privacy of some programs. Con- 
sider the following example: 


for (i = 0; i < size; i++) 
{ 
outputdata(*datat++) ; 
/* outputdata copies data 
out of the secure 
compartment */ 


If outputdata causes i to be swapped to mem- 
ory, and if i and data are not in the same cache 
line, then an attacker can cause the loop to be exe- 
cuted many more times than it should. Assuming the 
attacker knows where i is stored, he can record the 
value of i during an iteration of the loop, and then re- 
place the incremented value by the pre-recorded value 
each time through the loop. In this way, outputdata 
gets called with each data value up to the end of the 
data segment, thus revealing a lot more to the out- 
side world than initially intended. If data is stored 
on the stack, it might be possible to replace it with an 
address in the code segment to reveal a large portion 
of the program’s code. 

To pull this attack off, the adversary would pre- 
sumably single step the program,® flushing the cache 
between steps. In this way, he can obtain a cache- 
line level observation of the program’s memory ac- 
cess patterns. By observing loop like patterns in the 
program counter, the adversary can search for loops. 
Loops that cause data to be copied out of the secure 
compartment can be identified by the unencrypted 
data that they are writing to memory, or, even bet- 
ter, by the unprotected system calls that are being 
called with the data.*+ All the adversary has left to 
do is guess the location of i in the stack (the gen- 
eral position of the stack will be apparent from the 
memory access pattern). 

Though this attack may seem involved, and despite 
the fact that the code sample is somewhat unlikely, it 
is quite plausible that a complex program will contain 
similar vulnerabilities, which a motivated adversary 
could find and exploit. There is a wealth of examples 
from the smartcard world where attacks of similar 
type have been carried out to extract secret informa- 
tion, as can be seen in [AK97]. 


3If single stepping is forbidden, a similar effect can probably 
be obtained by generating an interrupt after a small number 
of memory accesses. With luck, it would even be possible to 
turn off the processor’s caches. 

4In fact, it might be possible to find a suitable loop simply 
by observing patterns of system calls. 


3.3.2 Correcting XOM 


XOM can be fixed in a simple, though not optimal, 
way by combining it with our memory authentica- 
tion method. Essentially, XOM was attempting to 
provide two forms of protection: protection from an 
untrusted OS, and protection from untrusted off-chip 
memory. Its method for dealing with the untrusted 
OS seems much more open an approach than the one 
that is used by Palladium. It would make all the Pal- 
ladium applications possible without forcing the use 
of a certified (and presumably commercial) OS. As 
far as the protection of off-chip memory goes, XOM 
fails because memory is not properly authenticated. 
Protecting memory integrity with hash trees would 
solve XOM’s problem. 


4 Authentication Algorithm 
4.1 Hash Trees 


We verify the integrity of memory with a hash tree, 
also called a Merkle tree (see [Mer80]). In a hash 
tree, data is located at the leaves of a tree. Each 
node contains a collision resistant hash of the data 
that is in each one of the nodes or leaves that are 
below it. A hash of the root of the tree is stored in 
secure memory where it cannot be tampered with. 
Figure 1 shows the layout. 
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Figure 1: A hash tree. Chunks contain hashes that 
authenticate data in chunks lower in the tree. 


To check that a node or leaf in a hash tree has not 
been tampered with, we check that its hash matches 
the hash that is stored in its parent node, and that 
the parent node has not been tampered with. We 
repeat this process on the parent node, and on its 
parent node, all the way to the root of the tree that 
has its hash stored in secure memory. When writing 


to the hash tree, it is similarly necessary to update 
all the nodes above the modification up to the secure 
root hash. 

An m-ary hash tree allows integrity verification 
with a constant factor overhead in memory consump- 
tion of m/(m—1). With a balanced tree, the number 
of hash checks to perform is log,,(V), where N is 
the amount of memory to be protected, expressed in 
multiples of the size of a hash. The cost of each hash 
computation is proportional to m (i.e., the amount 
of data to hash). 

These costs of using a hash tree are quite modest 
considering that they allow a very small amount of 
secure memory (typically 128 to 160 bits) to verify 
the integrity of arbitrarily large amounts of memory. 
For a 4-ary tree, one quarter of memory ends up be- 
ing used by hashes, which is large, but not unaccept- 
ably so. However memory bandwidth is a major con- 
straint in high performance systems, and the number 
of hashes that must be read from memory increases 
very quickly with memory size and can easily exceed 
10. Unless this access pattern is optimized, system 
performance is sure to be dismal. 


4.2 Hash Trees in the Memory Hier- 
archy 


Proper placement of the hash tree checking and gen- 
eration machinery is critical in ensuring good per- 
formance. On first thought, the machinery could be 
placed between two layers of the memory hierarchy. 
The higher layers would not know about the hash 
tree. On a miss, they would use the hash tree ma- 
chinery to read and authenticate data from the lower 
part of the hierarchy. There are two logical places for 
these operations to be carried out: 


1. One possibility is to place the hash tree machin- 
ery between L2° and external memory. This 
way cached data can be retrieved very fast with- 
out having to go through any integrity verifica- 
tion. However, this option implies that L2 cache 
misses will result in a whole path from leaf to 
root of the hash tree being fetched from external 
memory on a cache miss, which is unacceptably 
slow. 


2. Another possibility is to place the hash tree ma- 
chinery between L1 and L2. Thus, frequently 
used paths in the hash tree will end up being 
cached in L2 and will not put such a high load 
on memory bandwidth. The drawback is that ac- 
cessing L2 becomes much more costly since each 
access to L2 involves checking a whole path in 
L2. 


5We are assuming a processor with on-chip L1 and L2 cache. 


Both schemes share a common drawback. Each 
access that goes through the hash tree layer requires 
log,,, IV) hash checks, and consequently accessing all 
the data that those hashes are to be computed on. In 
Section 5 we use scheme 1 as a representative naive 
scheme, and refer to it as naive. The following sec- 
tion shows an optimized hash tree implementation 
that gets around this problem and which makes hash 
trees a reasonable choice in a high performance pro- 
cessor. 


4.3. Making Hash Trees Fast: chash 


To make hash trees fast, we have to merge the hash 
tree machinery with one of the cache levels. Values 
that are stored in this combined cache are trusted, 
which allows accesses to be performed directly on 
cached values without any hashing, giving us the ad- 
vantage of scheme 1 above. At the same time, data 
that is needed for hashing can now be read from the 
cache. This reduces the latency to the data as in 
scheme 2, and has the additional advantage that if a 
hash comes from the cache, it is trusted, and there- 
fore it is not necessary to continue checking hashes 
up to the root of the tree. 

The following algorithms show how the combined 
cache that has all these nice properties can be imple- 
mented. In these algorithms the word cache refers to 
the combined cache (which is assumed to be trusted), 
and the word memory refers to the next level in 
the memory hierarchy.° The memory is divided into 
chunks that are the basic unit that hashes are com- 
puted on. For now we will consider that chunks coni- 
cide with cache blocks. 

ReadAndCheckChunk: Reads data from exter- 
nal memory and checks it. 


1. Read the chunk from memory. 


2. Return the chunk to the caller so that it can start 
speculative execution. 


3. Start hashing the chunk that we just read. In 
parallel, recursively call ReadAndCheck to fetch 
the chunk’s hash from its parent chunk. If the 
chunk is in fact the root chunk, its hash is fetched 
directly from secure memory instead of calling 
Read AndCheck. 


4. Compare the hash we just computed with the 
one in the parent chunk. If they do not match, 
raise an exception. 


ReadAndCheck: Called when the processor ex- 
ecutes a read instruction. 


6We will work with L2 as the cache, and off-chip RAM as 
the memory. 


1. If the data is cached, return the cached data. We 
are done. 


2. Call ReadAndCheckChunk on the data’s chunk. 
3. Put the read chunk into the cache. 
4. Return the requested data. 


Write: Called when the processor executes a write 
instruction. 


1. If the data to be modified is in the cache, modify 
it directly. We are done. 


2. Otherwise, use ReadAndCheckChunk to get the 
chunk data, and put it into the cache (we are 
implementing a write-allocate cache here). 


3. Modify the data in the cache. 


Write-Back: Called when a dirty cache block is 
evicted. 


1. Compute the hash on the modified chunk. 


2. In a way that makes both changes visible si- 
multaneously, write the chunk to memory and 
change its hash in the parent chunk using the 
Write operation described above (unless it is the 
root chunk, in which case the hash is stored in 
secure memory). 


Intuitively, with this algorithm, when a node of 
the hash tree is loaded into the cache, it is used as 
the root of a new hash tree. This is valid because 
the node is now stored in secure on-chip storage, and 
thus no longer needs to be protected by its parent 
node in the main hash tree. The performance ad- 
vantage results because the new tree is smaller than 
the original tree, which reduces the path from leaf to 
node. As far as correctness goes, the algorithm’s es- 
sential invariant is that at any time, a hash contained 
in the tree is a hash of the child chunk in memory.” 
On writes, the hash only gets recomputed when the 
changes are written back. 

Note that this algorithm implements a write- 
allocate cache. This is sensible since performing a 
word write requires the word’s whole chunk to be 
read in for hashing anyways. Nevertheless, a useful 
optimization can be made, inspired by normal cache 
technology: if write allocation simply marks unwrit- 
ten words as invalid rather than loading them form 
memory, then chunks that get entirely overwritten 
don’t have to be read from memory and checked. 
This optimization eliminates one chunk read from 
memory and one hash computation. 


7 This invariant is in fact a bit too strong for this algorithm, 
but will be necessary for the versions that are described in 
the next sections. We could reduce the invariant to: hashes 
of uncached chunks must be valid, hashes of cached chunks 
can have an arbitrary value. The last step of the write-back 
algorithm can then take place in two steps: update the hash, 
then write the hash back to memory. 


4.4 Multiple Cache Blocks 


Chunk: mhash 


per 


In the algorithm described above, we have assumed 
that there is exactly one cache block per chunk. Since 
the cache block is usually chosen to optimize the per- 
formance of the processor when security is turned off, 
it turns out that the chunk size is completely con- 
strained before the memory integrity option is even 
considered. To allow more flexible selection of the 
chunk size, let us consider an improved algorithm 
that does not require that chunks coincide with cache 
blocks. 

The modified algorithm is described below. Parts 
that are unchanged appear in small type. Note that 
ReadAndCheckChunk returns the data that is in 
memory. This data will be stale when the cache con- 
tains a dirty copy of some cache blocks. 

Read AndCheckChunk 


1. Read cache blocks that are clean in the cache 
directly from the cache. Read the rest of the 
chunk from memory. 


2. Return the chunk to the caller so that it can start speculative 
execution. 


3. Start hashing the chunk that we just read. In parallel, re- 
cursively call ReadAndCheck to fetch the chunk’s hash from 
its parent chunk. If the chunk is in fact the root chunk, 
its hash is fetched directly from secure memory instead of 
calling ReadAndCheck. 


4. Compare the hash we just computed with the one in the 
parent chunk. If they do not match, raise an exception. 


Read AndCheck 

1. If the data is cached, return the cached data. We are done. 

2. Call ReadAndCheckChunk on the data’s chunk. 

3. Put the read chunk into the cache, except for cache 
blocks that are already cached in the dirty state. 

4. Return the requested data. 


Write 


1. If the data to be modified is in the cache, modify it directly. 
We are done. 


2. Otherwise, use ReadAndCheckChunk to get 
blocks that are missing from the cache. Write 
them to the cache (we are implementing a write- 
allocate cache here). 

3. Modify the data in the cache. 


Write-Back 


1. If the chunk is not entirely contained in the 
cache, use ReadAndCheckChunk to get the miss- 
ing data. 


2. Mark all the chunk’s cached blocks as clean. 
3. Compute the hash on the modified chunk. 


4. Ina way that makes both changes visible simultaneously, 
write the blocks that were dirty to memory and 
change its hash in the parent chunk using the Write operation 
described above (unless it is the root chunk, in which case 
the hash is stored in secure memory). 


This algorithm can be further optimized by replac- 
ing the hash function by an incremental MAC (Mes- 
sage Authentication Code). This MAC has the prop- 
erty that single cache block changes can be applied 
without knowing the value in the other cache blocks. 
An example of such a MAC is presented in [BGR95], 
it is based on a conventional MAC function hy, and 
an encryption function Ex: 


Mae (™1,°++, Mn) = Ex (he(1,m1)@--- he (n, Mn)) 


Given a value of the MAC, it can be updated 
when m,; changes by decrypting the value, subtract- 
ing the old value of hy(i,m,), adding the new value 
of hy (2, m,;), and finally encrypting the new result. 

With this hash function, the Write-Back operation 
can be optimized so that it is not necessary to load 
the whole chunk from memory if part of it isn’t in 
the cache. 

Write-Back 


1. Read the parent MAC using the ReadAndCheck 
operation. 


2. Read the old value of the cache block from mem- 
ory directly (without checking it so that we don’t 
have to read the whole chunk). 


3. Calculate the new value of the MAC by doing an 
update. 


4. In a way that makes both changes visible simultaneously, 
write the chunk to memory and change its hash in the parent 
chunk using the Write operation described above (unless it 
is the root chunk, in which case the hash is stored in secure 
memory). 


We will refer to the optimized algorithm as mhash 
in Section 5. 


4.5 Simplified Memory Organization 


We have chosen to adopt a very simple memory or- 
ganization in which all of physical memory is authen- 
ticated. Physical memory is assumed to be present 
as a contiguous segment beginning at address 0 that 
we want to authenticate completely. While this as- 
sumption is quite restrictive as far as real systems go, 
it is quite adequate for our purposes of studying the 
performance cost of protecting RAM with hash trees. 

The layout of the hash tree in RAM is equally sim- 
ple. The memory is stored in equal sized chunks. 
Each chunk can store data or can store m hashes. 
Chunk are numbered consecutively starting from zero 
so that a chunk’s number multiplied by the size of the 
chunk produces the chunk’s starting address. 

To find the parent of a chunk, we subtract one 
from the chunk’s number divided by m and round 
down. If the result is negative then the chunk’s hash 
is stored in secure memory. Otherwise, the result is 


the parent chunk’s address. The remainder of the 
division indicates the index of the chunk’s hash in its 
parent chunk. 

The resulting tree is almost a balanced m-ary tree. 
In general, the m balanced subtrees aren’t quite bal- 
anced as there aren’t enough elements to fill the last 
level completely. 

The interesting features of this layout are that it is 
very easy to find a chunk’s parent when m is a power 
of two, and all the leaves are contiguous. 


4.6 Real Life Issues 
4.6.1 Direct Memory Access 


For now we have considered that all of memory is 
authenticated. This assumption breaks down when 
data is inserted directly into memory by a device 
through Direct Memory Access (DMA). 

One way of dealing with this is to set aside an 
unprotected area of memory for use in DMA trans- 
fers. Once the transfer is done, programs can copy 
the data into a secure buffer before using it. This 
method has the drawback of requiring an extra copy 
of the data. But in many systems this is not a major 
penalty as the operating system typically performs a 
copy of incoming data from a kernel buffer into user 
space anyways. 

Data coming from the DMA is untrusted (since it 
comes from off chip), so once it has been brought 
into authenticated memory, it must be authenticated 
by the application program using some scheme of its 
choosing. 

For safety, the processor should only allow reads 
to unprotected memory when a special ReadWith- 
outChecking instruction is used. That way a program 
cannot be tricked into reading unauthenticated data 
when it expects authenticated data. 


4.6.2 Initialization 


So far we have considered how the processor executes 
when memory is authenticated. It is important to 
consider how to initialize secure mode since a flaw in 
this step would make all our efforts futile. Here is the 
proposed procedure: 


1. Turn on hashing, but for now do not 
bother checking hashes during the ReadAnd- 
CheckChunk. In this mode hash trees will be 
computed, but no checking is done. 


2. Touch (write to) each chunk that is to be covered 
by the hash tree. In this way each chunk ends 
up in the cache in a dirty state. As chunks are 
written back, higher levels of the hash tree will 
get updated. 


3. Flush the cache. This forces all the dirty chunks 
to be written back to memory. These write- 
backs will cause their parent nodes to appear 
in the cache in a dirty state. The parents will 
in turn be written back to the cache, and so on 
until the whole tree has been computed.® ® 


4. Turn on the memory authentication failure ex- 
ceptions. 


5. Generate the key that will be used by this 
program for cryptographic purposes (see Sec- 
tion 3.1). 


At this point, the program is running in secure 
mode, and its key has been generated. It can now run 
and eventually sign its results, unless tampering takes 
place resulting in the destruction of the program’s 
key. 


4.7 Precise Exceptions 


In the algorithms presented above, it has been stated 
that hash checks related with ReadChunk can be 
completed in the background while the returned value 
is used speculatively. The question is: how far can 
execution continue speculatively? 

Since there is no general way of recovering from 
tampering other than restarting the program execu- 
tion from scratch, it appears that there is no need to 
make memory authentication failure exceptions pre- 
cise. Therefore, execution can proceed without wait- 
ing for authentication checks to complete, and spec- 
ulative instructions can commit. 

There is however an exception to this rule as far 
as cryptographic operations go. We saw in section 2 
that the program can perform cryptographic opera- 
tions using a key that is a function of the running pro- 
gram. These operations must not allow their results 
to be seen outside the processor before all preced- 
ing hash checks have passed. Otherwise an adversary 
would be able to make a change to some data just 
before a program performs a cryptographic primitive 
on it. With luck, the result of the operation could be 
sent off-chip before the hash checks completed, and 
the adversary would have tricked the system into ap- 
plying its cryptography on data to which it shouldn’t 


8In fact, with this procedure, each hash might be computed 
a number of times that is equal to the arity of the tree. The 
procedure that is described here could be optimized to produce 
only one computation of each hash, but this would require 
added assumptions about the instruction set architecture to 
describe precisely, and would not impact the security of the 
scheme. 

°In the case where incremental MACs are used, all MAC 
computations are incremental. So this cache flushing trick 
would not work. Therefore, the initialization must be mod- 
ified so that it actually computes a MAC from scratch. 


have been applied. This attack would allow the ad- 
versary to sign, encrypt or decrypt a message of his 
choice even though he does not have the key. 

Therefore, cryptographic instructions must act as 
barriers for speculative execution of instructions that 
rely on unauthenticated data. They will stall at the 
commit stage until all hash checks have completed. 
For debugging purposes, it might be desirable to pro- 
vide a mode in which all instructions behave as bar- 
riers. 


5 Evaluation 


This section evaluates our memory authentication 
scheme using a processor simulator. First, we de- 
scribe the hardware implementation of our algorithm 
and the additional logic required by our scheme. 
Next, the simulation framework used for the experi- 
ments is described. Performance and memory space 
overheads of the scheme are then discussed. We also 
study the effects of various architectural parameters 
on memory authentication performance. Finally, the 
incremental MAC based method to reduce the mem- 
ory overhead is evaluated. 


5.1 Hardware Implementation 


We describe the implementation of the chash scheme. 
The mhash scheme uses the same datapaths but re- 
quires additional control. 

A hash checking/generating unit is added next to 
the L2 cache. Whenever there is a L2 cache miss, a 
new cache block is read from the main memory, and 
added to the hash read buffer unit which checks au- 
thenticity (Figure 2 (a)). The hashing unit computes 
a hash of the new cache block, and compares with 
a previously stored hash, which is read from the L2 
cache (or a root hash register if the hash happens to 
be the root of the tree). If two hashes do not match 
each other, a security exception is raised. 

Similarly, when a cache block gets evicted from the 
L2 cache, it is stored in the hash write buffer unit 
while writing back the block to the main memory 
(Figure 2 (b)). The hash unit computes a new hash 
of the evicted block and stores the hash back into the 
L2 cache. 


5.2 Logic Overhead 


To evaluate the cost of computing hashes, we consid- 
ered the MD5 [Riv92] (and SHA-1 [DEJO1]) hashing 
algorithms. The core of each algorithm is an opera- 
tion that takes a 512-bit block, and produces a 128- 
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Figure 2: Hardware implementation of the chash scheme. (a) L2 cache miss: read from the memory. (b) L2 


write back: write to the memory. 


bit (or 160-bit, respectively) digest.1° 

In each case, simple 32-bit operations are per- 
formed over 80 rounds. In each round there are 2 
to 4 logic levels, as well as 2 adders. We noted that 
with suitable skewing of the adders, rounds can be 
performed in one cycle per round on average. 

The total amount of 32-bit logic blocks that is re- 
quired for the 80 rounds is 260 adders, 32 multiplex- 
ers, 16 inverters, 16 or gates and 48 xor gates (for 
SHA-1, 325 adders, 60 and gates, 40 or gates, 20 
multiplexers and 272 xor gates). If these were all laid 
out, we would therefore need on the order of 50,000 
1-bit gates altogether. In fact, the rounds are very 
similar to each other so it should be possible to have 
a lot of sharing between them. To exploit this we 
chose a hash throughput of one per 20 cycles. This 
should allow the circuit size to be divided by a factor 
of 10 to 20. 


5.3. Simulation Framework 


Our simulation framework is based on the Sim- 
pleScalar tool set [BA97], which models speculative 
out-of-order execution. To model the memory band- 
width usage more accurately, separate address and 
data buses were implemented. All structures that 
access the main memory including a L2 cache and 
the hash unit share the same bus. 

The architectural parameters used in the simula- 
tions are shown in Table 1. SimpleScalar is config- 
ured to execute Alpha binaries, and all benchmarks 


10Tn fact, for variable length messages, the output from the 
previous 512-bit block is used as an input to the function that 
digests the next 512-bit block. Since we are dealing with fixed- 
length messages of less than 512 bits, we do not need this. 


Architectural parameters Specifications 
Clock frequency 1 GHz 
L1 I-caches 64KB, 2-way, 32B line 
L1 D-caches 64KB, 2-way, 32B line 
L2 caches Unified, IMB, 4-way, 64B line 
Lil latency 2 cycles 
L2 latency 10 cycles 
Memory latency (first chunk) 80 cycles 
I/D TLBs 4-way, 128-entries 
Memory bus 200 MHz, 8-B wide (1.6 GB/s) 
Fetch/decode width 4 / 4 per cycle 
issue/commit width 4/4 per cycle 
Load/store queue size 64 
Register update unit size 128 
Hash latency 80 cycles 
Hash throughput 3.2 GB/s 
Hash read/write buffer 16 
Hash length 128 bits 


Table 1: Architectural parameters used in simula- 
tions. 


are compiled on EV6 (21264) for peak performance. 

For all the experiments in this section, nine 
SPEC2000 CPU benchmarks [Hen00] are used as 
representative applications: gcc, gzip, mcf, twolf, 
vortex, vpr, applu, art, and swim. These bench- 
marks show varied characteristics such as the level 
of ILP (instruction level parallelism), cache miss- 
rates, etc. By simulating these benchmarks, we can 
study the impact of memory authentication on vari- 
ous types of applications. 

To capture the characteristics of benchmarks in the 
middle of computation, each benchmark is simulated 
for 100 million instructions after skipping the first 1.5 
billion instructions. In the simulations, we ignore the 
initialization overhead of the hash tree. Given the 
fact that benchmarks run for a long time, the over- 
head should be negligible compared to the steady- 


state performance. 


5.4 Performance Impact of Memory 
Authentication 


On-line memory authentication requires computing 
and checking a hash for every read from off-chip mem- 
ory. At the same time, a new hash should be com- 
puted and stored on a write-back to memory. Mem- 
ory authentication implies even more work for mem- 
ory operations, which already are rather expensive. 
Therefore, the obvious first concern of memory au- 
thentication is its impact on application performance. 

Fortunately, computing and checking hashes do not 
always increase memory latency. We can optimisti- 
cally continue computation as soon as data arrives 
from the memory while checking their authenticity 
in the background. Checking the authenticity of data 
hurts memory latency only when read/write buffers 
are full. 

Authenticating memory traffic, however, can de- 
grade the memory performance in two ways: L2 cache 
pollution and memory bandwidth pollution. First, 
if we cache hashes in the L2 cache, hashes contend 
with regular application data and can degrade the 
L2 miss-rate for application data. On the other hand, 
loading and storing hashes from/to the main mem- 
ory increases the memory bandwidth usage, and may 
steal bandwidth from applications. 

Figure 3 illustrates the impact of memory authenti- 
cation on application performance. For six different 
L2 cache configurations, the IPCs (instructions per 
cycle) of three schemes are shown: a standard proces- 
sor (base), memory authentication with caching the 
hashes with a single cache block per chunk (chash), 
and memory authentication without caching (naive). 

The figure first demonstrates that the performance 
overhead of memory authentication can be surpris- 
ingly low if we cache hashes. Even though the on-line 
memory authentication algorithm based on a hash 
tree can cause tens of additional memory accesses per 
L2 cache miss, the performance degradation of chash 
compared to base is less than 50% in the worst case 
(mcf in the 64B, 256KB case). Moreover, the perfor- 
mance degradation decreases rapidly as either the L2 
cache size or the block size increases. For a 4-MB L2 
cache, all nine benchmarks run with less than 20% 
performance hit. 

The importance of caching the hashes is also clearly 
shown in the figure. Without caching (naive), some 
programs can be slowed down by factor of ten in the 
worst case (swim and applu). In the case of the naive 
scheme, even increasing the cache size or the cache 
block size does not reduce the overhead. For example, 
app1lu is still ten times slower than the base case with 


10 


i base-256K @ chash-256K 0 base-4M 1 chash-4M 


30 


L2 miss-rate (%) 


mcf 


gcc 


gzip twolf vortex vpr app 


Figure 4: L2 cache miss-rates of program data for a 
standard processor (base) and memory authentica- 
tion with caching (chash). The results are shown for 
256-KB and 4-MB caches with 64-B cache blocks. 


a 64-B, 4-MB L2 cache. 

Finally, Figure 3 shows the effect of changing the 
L2 cache size and the L2 block size on the perfor- 
mance. Having a larger L2 cache reduces authen- 
tication performance since it reduces the number of 
off-chip accesses. A large L2 cache is likely to result 
in better hash hit-rate without hurting application 
hit-rate. Having a larger L2 block also reduces the 
overhead of memory authentication by having less 
levels in the hash tree. However, a non-optimal L2 
block size can degrade the baseline performance as 
shown in Figure 3. 

In the following subsections, we discusses the per- 
formance considerations of memory authentication in 
more detail. 


5.4.1 Cache Contention 


Since we cache hashes sharing the same L2 cache with 
a program executing on a processor, both hashes and 
application data contend for L2 cache space. This can 
increase the L2 miss-rate for a program and degrade 
the performance. 

The effect of cache contention is studied in Fig- 
ure 4. The figure depicts the L2 miss-rates of 
the baseline case and memory authentication with 
caching. As shown, for a small L2 cache, the miss- 
rate can be noticeably increased by caching the 
hashes. In fact, cache contention is the major source 
of performance degradation for twolf, vortex, and 
vpr. However, as the L2 cache size increases, cache 
contention is alleviated. For example, with a 4-MB 
L2 cache, none of the benchmarks show noticeable L2 
miss-rate degradation. We note that increasing the 
L2 block size (block = chunk) alleviates cache con- 
tention by reducing the number of hashes to cover a 
given memory space (not shown). 
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Figure 3: IPC comparison of three different schemes for various L2 cache configurations: standard proces- 
sors without memory authentication (base), memory authentication with caching the hashes (chash), and 
memory authentication without caching hashes (naive). Results are shown for different cache sizes (256KB, 
1MB, 4MB) and different cache block sizes (64B, 128B). 
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5.4.2 Bandwidth Pollution 


Another major concern of the memory authentication 
scheme is the increase in the memory bandwidth us- 
age. In the worst case, one L2 cache miss causes the 
entire hash hierarchy corresponding to the L2 block 
to be loaded from memory. 

Fortunately, the simulation results in Figure 5 in- 
dicate that caching works very well for the hash 
tree. Figure 5 (a) shows the average number of hash 
blocks loaded from the main memory on a L2 cache 
miss. Without caching the hashes, every L2 miss 
causes thirteen additional memory reads for this con- 
figuration as shown by the naive scheme. However, 
with caching, the number of additional memory reads 
is less than one for all benchmarks. As a result, 
the overhead of the memory bandwidth usage with 
caching is very small compared to the case without 
caching (Figure 5 (a)). 

For programs that have low bandwidth usage, the 
increase of the bandwidth usage due to memory au- 
thentication is not a problem since loading the hashes 
just uses extra bandwidth. In our simulations, the 
bandwidth pollution is a major problem only for mcf, 
applu, art, and swim even though accessing hashes 
increases the bandwidth usage for all benchmarks. 


5.5 Effects of Hash Parameters 


There are two architectural parameters in our mem- 
ory authentication scheme: the throughput of hash 
computation and the size of hash read/write buffers. 
This subsection studies the trade-offs in varying these 
parameters. 

The throughput of computing hashes varies de- 
pending on how the logic is pipelined. Obviously, 
higher throughput is better for the performance, but 
requires larger space to implement. Figure 6 shows 
the IPC of various applications using memory au- 
thentication with caching for varying hash through- 
put. 

As shown in the figure, having higher through- 
put than 3.2GB/s does not help at all. When the 
throughput lowers to 1.6GB/s, which is the same 
as memory bandwidth, we see minor performance 
degradation. If the hash throughput is lower than 
the memory bandwidth, it directly impacts and de- 
grades the performance. In our experiments, the IPC 
degraded as much as 50% for mcf, applu, art, and 
swim. This is because the effective memory band- 
width is limited by the hash computing throughput. 
Therefore, the hash throughput should be slightly 
higher than the memory bandwidth. 

Figure 7 studies the effect of the hash buffer size 
on the application performance (IPC). The hash read 
buffer holds a new L2 cache block while its hash gets 
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Figure 6: The effect of hash computation throughput 
on performance. The results are shown for a 1-MB 
cache with 64-B cache blocks. 6.4GB/s = one hash 
per 10 cycles. 
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Figure 7: The effect of hash buffer size on perfor- 
mance. The results are shown for a 1-MB cache with 
64-B cache blocks. 


computed and checked with the previously stored 
hash. Similarly, the hash write buffer holds an 
evicted L2 cache block until a new hash of the block 
is computed and stored back in the L2 cache. A 
larger buffer allows more memory transactions to be 
outstanding. However, given the fact that the hash 
computation throughput is higher than the memory 
bandwidth, the hash buffer size does not affect the 
performance. 


5.6 Reducing Memory Size Overhead 


With one hash (128 bits) covering a 64-B cache line, 
25% of main memory space is used to store hash val- 
ues. This memory overhead also implies that these 
hash values will contend for the L2 cache space and 
comsume the memory bandwidth, which can result in 
performance degradation. Therefore, reducing mem- 
ory overhead is essential to reduce the overall memory 
authentication overhead. 

The most straightforward way to reduce mem- 


GB chash @ naive 


GB chash @ naive 


g 14 » 16 
Do 

Fit J 3 EES a 

as Brae 

o 10 ,-------- --------4----------f------------- £ 

a 2 190 +--8----B----_----» ---g---- 38 ---- a - 

g oy ee. a 
< 3 4------3----@-... 3... ----B-.--@--.----- 

8 a oe en ee en i a i s oo oe oo oe fo oo oo tee ee 

© 4 -+---f-----[ ----#----f----- ----fB------------- 4 

aa N 4 4--4----B----@3----@-----B---- 8B ---- ---- 

Sy is 

Cen en en i fn fa lee £ 2 +--40-----}--)--} (l--- -- i --- 0 --- --— 

S 0 2 0 : 

gcc gzip mcf twolf vortex vpr applu art swim gcc gzip mcf twolf vortex vpr applu art swim 


(a) 


(b) 


Figure 5: Memory bandwidth usage for a standard processor, memory authentication with caching, without 
caching. The L2 cache is 1 MB with 64-B cache blocks. (a) The additional number of hash loads from 
memory per L2 cache miss. (b) Normalized memory bandwidth usage (normalized with base). 
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Figure 8: The performance of the mhash scheme with 
two cache blocks per chunk. The results are shown 
for a 1-MB cache. 


ory overhead is to increase the L2 cache block size. 
As shown in Section 5.4, having 128-B L2 blocks 
rather than 64-B significantly reduces the perfor- 
mance degradation compared to the base case. On 
the other hand, large block sizes often result in poor 
baseline performance due to poor cache performance. 

Another way to reduce the memory overhead is 
to make one hash cover multiple L2 cache blocks. 
However, in this case, all cache blocks covered by 
the same hash should be fetched to authenticate any 
one of them. Also, write back involves more memory 
operations. Therefore, the mhash scheme with 2 or 
more cache blocks per chunk tends to consume more 
bandwidth than the chash scheme. 

Figure 8 compares the performance of using one 
hash per 64-B L2 block (chash-64B), one hash per 
128-B L2 block (chash-128B), and one hash per two 
64-B L2 blocks (mhash-64B). In general, mhash-64B 


performs comparable to chash-64B. For benchmarks 
with high bandwidth usage, mhash-64B performs 
worse than chash-64B. For benchmarks sensitive to 
L2 cache contention, it performs better. Therefore, 
the right algorithm should be chosen based on the 
type of main applications. 


On the other hand, mhash-64B always outperforms 
chash-128B. Moreover, increasing the L2 cache block 
size can degrade performance even when the appli- 
cation do not use memory authentication scheme. 
Therefore, to reduce memory size overhead, it ap- 
pears that is it always better to make one hash 
cover multiple cache blocks rather than increasing the 
cache block size. 


Conclusion 


We have presented a memory authentication scheme 
that can be used to build high performance secure 
computing platforms out of slightly modified general- 
purpose processors. By integrating the hash tree ma- 
chinery with an on-chip (L2) cache, we arrived at a 
memory authentication scheme with reasonable over- 
heads. The evaluations we have carried out show, for 
instance, that for large L2 sizes, performance over- 
head is + 20%, area overhead is ~ 10,000 gates, and 
12.5 to 25% of untrusted external memory is used up 
by hashes. 


Ongoing work includes the investigation of off-line 
memory authentication schemes, and the generaliza- 
tion of authentication schemes to SMP systems. 
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